The English Premier League is a popular domestic league in the world
Design a predictive model capable of predicting if away team will win a football match
Date=Match Date (dd/mm/yy)
Referee=Match Referee
Keys to result data:
HomeTeam=Home Team
AwayTeam=Away Team
FTHG and HG= Full Time Home Team Goals
FTAG and AG=Full Time Away Team Goals
FTR and Res=Full Time Result(H=Home Win, D=Draw, A=Away Win)
HTHG=Half Time Home Team Goals
HTAG=Half Time Away Team Goals
HTR=Half Time Result(H=Home Win, D=Draw, A=Away Win)
HS=Home Team Shots
AS=Away Team Shots
HST=Home Team Shots On Target
AST=Away Team Shots On Target
HC=Home Team Corners
AC=Away Team Corners
HF=Home Team Fouls Committed
HY=Home Team Yellow Cards
AY=Away Team Yellow Cards
#importing liberies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt
%matplotlib inline
# Read data from csv into a data frame
df=pd.read_csv("results.csv", encoding='latin1')
df.head(3)
| Season | DateTime | HomeTeam | AwayTeam | FTHG | FTAG | FTR | HTHG | HTAG | HTR | ... | HST | AST | HC | AC | HF | AF | HY | AY | HR | AR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1993-94 | 1993-08-14T00:00:00Z | Arsenal | Coventry | 0 | 3 | A | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 1993-94 | 1993-08-14T00:00:00Z | Aston Villa | QPR | 4 | 1 | H | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 1993-94 | 1993-08-14T00:00:00Z | Chelsea | Blackburn | 1 | 2 | A | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 rows × 23 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11113 entries, 0 to 11112 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Season 11113 non-null object 1 DateTime 11113 non-null object 2 HomeTeam 11113 non-null object 3 AwayTeam 11113 non-null object 4 FTHG 11113 non-null int64 5 FTAG 11113 non-null int64 6 FTR 11113 non-null object 7 HTHG 10189 non-null float64 8 HTAG 10189 non-null float64 9 HTR 10189 non-null object 10 Referee 8289 non-null object 11 HS 8289 non-null float64 12 AS 8289 non-null float64 13 HST 8289 non-null float64 14 AST 8289 non-null float64 15 HC 8289 non-null float64 16 AC 8289 non-null float64 17 HF 8289 non-null float64 18 AF 8289 non-null float64 19 HY 8289 non-null float64 20 AY 8289 non-null float64 21 HR 8289 non-null float64 22 AR 8289 non-null float64 dtypes: float64(14), int64(2), object(7) memory usage: 2.0+ MB
#visualise the proportion of missing data
plt.figure(figsize=(15,8))
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
# Filling of null Values
df['HTHG'].fillna(df['HTHG'].mean(),inplace=True)
df['HTAG'].fillna(df['HTAG'].mean(),inplace=True)
df['HST'].fillna(df['HST'].mean(),inplace=True)
df['AST'].fillna(df['AST'].mean(),inplace=True)
df['HC'].fillna(df['HC'].mean(),inplace=True)
df['AC'].fillna(df['AC'].mean(),inplace=True)
df['HF'].fillna(df['HF'].mean(),inplace=True)
df['AF'].fillna(df['AF'].mean(),inplace=True)
df['HY'].fillna(df['HY'].mean(),inplace=True)
df['AY'].fillna(df['AY'].mean(),inplace=True)
df['HR'].fillna(df['HR'].mean(),inplace=True)
df['AR'].fillna(df['AR'].mean(),inplace=True)
df['HS'].fillna(df['HS'].mean(),inplace=True)
df['AS'].fillna(df['AS'].mean(),inplace=True)
df['HTR'].fillna(df['HTR'].mode()[0],inplace=True)
df['Referee'].fillna(df['Referee'].mode()[0],inplace=True)
#visualise if there is still a portion of missing values
plt.figure(figsize=(15,8))
sns.heatmap(df.isnull(),yticklabels=False,cmap='viridis')
<AxesSubplot:>
#only away win
def convert(result):
if result == 'A':
return 'A'
else:
return 'NA'
df['FTR']=df['FTR'].apply(convert)
htr=pd.get_dummies(df['HTR'],drop_first=True)
df.drop(['HTR'],axis=1,inplace=True)
df.head()
| Season | DateTime | HomeTeam | AwayTeam | FTHG | FTAG | FTR | HTHG | HTAG | Referee | ... | HST | AST | HC | AC | HF | AF | HY | AY | HR | AR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1993-94 | 1993-08-14T00:00:00Z | Arsenal | Coventry | 0 | 3 | A | 0.684758 | 0.501031 | M Dean | ... | 6.117264 | 4.768247 | 6.081795 | 4.784292 | 11.379057 | 11.873447 | 1.415852 | 1.746532 | 0.062854 | 0.089396 |
| 1 | 1993-94 | 1993-08-14T00:00:00Z | Aston Villa | QPR | 4 | 1 | NA | 0.684758 | 0.501031 | M Dean | ... | 6.117264 | 4.768247 | 6.081795 | 4.784292 | 11.379057 | 11.873447 | 1.415852 | 1.746532 | 0.062854 | 0.089396 |
| 2 | 1993-94 | 1993-08-14T00:00:00Z | Chelsea | Blackburn | 1 | 2 | A | 0.684758 | 0.501031 | M Dean | ... | 6.117264 | 4.768247 | 6.081795 | 4.784292 | 11.379057 | 11.873447 | 1.415852 | 1.746532 | 0.062854 | 0.089396 |
| 3 | 1993-94 | 1993-08-14T00:00:00Z | Liverpool | Sheffield Weds | 2 | 0 | NA | 0.684758 | 0.501031 | M Dean | ... | 6.117264 | 4.768247 | 6.081795 | 4.784292 | 11.379057 | 11.873447 | 1.415852 | 1.746532 | 0.062854 | 0.089396 |
| 4 | 1993-94 | 1993-08-14T00:00:00Z | Man City | Leeds | 1 | 1 | NA | 0.684758 | 0.501031 | M Dean | ... | 6.117264 | 4.768247 | 6.081795 | 4.784292 | 11.379057 | 11.873447 | 1.415852 | 1.746532 | 0.062854 | 0.089396 |
5 rows × 22 columns
df['HomeTeam'].describe()
count 11113 unique 50 top Arsenal freq 552 Name: HomeTeam, dtype: object
df['HomeTeam'].value_counts()
Arsenal 552 Man United 552 Liverpool 551 Everton 551 Tottenham 551 Chelsea 550 Newcastle 513 West Ham 495 Aston Villa 494 Man City 456 Southampton 419 Blackburn 327 Sunderland 304 Leicester 302 Fulham 285 Middlesbrough 266 Leeds 248 West Brom 247 Bolton 247 Crystal Palace 227 Stoke 190 Norwich 172 Coventry 156 Wigan 152 Charlton 152 Wolves 149 Burnley 148 Watford 148 Sheffield Weds 137 Wimbledon 137 Birmingham 133 Swansea 133 Derby 133 Portsmouth 133 QPR 118 Hull 95 Bournemouth 95 Brighton 92 Ipswich 80 Sheffield United 78 Nott'm Forest 78 Reading 57 Bradford 38 Cardiff 38 Huddersfield 38 Oldham 21 Swindon 21 Barnsley 19 Blackpool 19 Brentford 16 Name: HomeTeam, dtype: int64
from sklearn.compose import make_column_selector as selector
categorical_columns_selector=selector(dtype_include=object)
categorical_columns=categorical_columns_selector(df)
categorical_columns
['Season', 'DateTime', 'HomeTeam', 'AwayTeam', 'FTR', 'Referee']
df_categorical=df[categorical_columns]
df_categorical.head()
| Season | DateTime | HomeTeam | AwayTeam | FTR | Referee | |
|---|---|---|---|---|---|---|
| 0 | 1993-94 | 1993-08-14T00:00:00Z | Arsenal | Coventry | A | M Dean |
| 1 | 1993-94 | 1993-08-14T00:00:00Z | Aston Villa | QPR | NA | M Dean |
| 2 | 1993-94 | 1993-08-14T00:00:00Z | Chelsea | Blackburn | A | M Dean |
| 3 | 1993-94 | 1993-08-14T00:00:00Z | Liverpool | Sheffield Weds | NA | M Dean |
| 4 | 1993-94 | 1993-08-14T00:00:00Z | Man City | Leeds | NA | M Dean |
from sklearn.preprocessing import OrdinalEncoder
df_categorical_columns=df_categorical[["HomeTeam","AwayTeam","Referee","DateTime","Season"]]
encoder=OrdinalEncoder().set_output(transform="pandas")
df_encoded=encoder.fit_transform(df_categorical_columns)
df_encoded
| HomeTeam | AwayTeam | Referee | DateTime | Season | |
|---|---|---|---|---|---|
| 0 | 0.0 | 15.0 | 75.0 | 0.0 | 0.0 |
| 1 | 1.0 | 34.0 | 75.0 | 0.0 | 0.0 |
| 2 | 14.0 | 4.0 | 75.0 | 0.0 | 0.0 |
| 3 | 25.0 | 37.0 | 75.0 | 0.0 | 0.0 |
| 4 | 26.0 | 23.0 | 75.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... |
| 11108 | 1.0 | 43.0 | 50.0 | 3440.0 | 28.0 |
| 11109 | 9.0 | 46.0 | 73.0 | 3441.0 | 28.0 |
| 11110 | 24.0 | 16.0 | 115.0 | 3441.0 | 28.0 |
| 11111 | 30.0 | 11.0 | 79.0 | 3441.0 | 28.0 |
| 11112 | 26.0 | 25.0 | 4.0 | 3442.0 | 28.0 |
11113 rows × 5 columns
df.drop(['HomeTeam','AwayTeam','Referee','DateTime','Season'],axis=1,inplace=True)
df2=pd.concat([df,df_encoded,htr],axis=1)
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11113 entries, 0 to 11112 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 FTHG 11113 non-null int64 1 FTAG 11113 non-null int64 2 FTR 11113 non-null object 3 HTHG 11113 non-null float64 4 HTAG 11113 non-null float64 5 HS 11113 non-null float64 6 AS 11113 non-null float64 7 HST 11113 non-null float64 8 AST 11113 non-null float64 9 HC 11113 non-null float64 10 AC 11113 non-null float64 11 HF 11113 non-null float64 12 AF 11113 non-null float64 13 HY 11113 non-null float64 14 AY 11113 non-null float64 15 HR 11113 non-null float64 16 AR 11113 non-null float64 17 HomeTeam 11113 non-null float64 18 AwayTeam 11113 non-null float64 19 Referee 11113 non-null float64 20 DateTime 11113 non-null float64 21 Season 11113 non-null float64 22 D 11113 non-null uint8 23 H 11113 non-null uint8 dtypes: float64(19), int64(2), object(1), uint8(2) memory usage: 1.9+ MB
#visualising if some columns are perfectly or highly correlated
plt.figure(figsize=(15,7))
sns.heatmap(df2.corr(),annot=True)
<AxesSubplot:>
df2.drop(['DateTime','Referee'],axis=1,inplace=True)
sns.pairplot(df2,hue='FTR')
<seaborn.axisgrid.PairGrid at 0x1cc1fc75330>
y=df2['FTR']
X=df2.drop(['FTR'],axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
from sklearn.svm import SVC
svc_model=SVC()
svc_model.fit(X_train,y_train)
SVC()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC()
predictions=svc_model.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
precision recall f1-score support
A 0.99 0.74 0.85 955
NA 0.91 1.00 0.95 2379
accuracy 0.92 3334
macro avg 0.95 0.87 0.90 3334
weighted avg 0.93 0.92 0.92 3334
print(confusion_matrix(y_test,predictions))
[[ 707 248] [ 7 2372]]
predictions
array(['A', 'NA', 'NA', ..., 'NA', 'NA', 'A'], dtype=object)
The model is pretty good with 92% accuracy